Goto

Collaborating Authors

 North Karelia





k-Means Clustering with Distance-Based Privacy

Neural Information Processing Systems

In this paper, we initiate the study of Euclidean clustering with Distance-based privacy. Distance-based privacy is motivated by the fact that it is often only needed to protect the privacy of exact, rather than approximate, locations.


K-Medoids For K-Means Seeding

Neural Information Processing Systems

We show experimentally that the algorithm clarans of Ng and Han (1994) finds better K -medoids solutions than the V oronoi iteration algorithm of Hastie et al. (2001). This finding, along with the similarity between the V oronoi iteration algorithm and Lloyd's K -means algorithm, motivates us to use clarans as a K -means initializer. We show that clarans outperforms other algorithms on 23/23 datasets with a mean decrease over k-means-++ (Arthur and V assilvitskii, 2007) of 30% for initialization mean squared error (MSE) and 3% for final MSE. We introduce algorithmic improvements to clarans which improve its complexity and runtime, making it a viable initialization scheme for large datasets.




k-Means Clustering with Distance-Based Privacy

Neural Information Processing Systems

In this paper, we initiate the study of Euclidean clustering with Distance-based privacy. Distance-based privacy is motivated by the fact that it is often only needed to protect the privacy of exact, rather than approximate, locations.


Enhanced High-Dimensional Data Visualization through Adaptive Multi-Scale Manifold Embedding

arXiv.org Artificial Intelligence

To address the dual challenges of the curse of dimensionality and the difficulty in separating intra-cluster and inter-cluster structures in high-dimensional manifold embedding, we proposes an Adaptive Multi-Scale Manifold Embedding (AMSME) algorithm. By introducing ordinal distance to replace traditional Euclidean distances, we theoretically demonstrate that ordinal distance overcomes the constraints of the curse of dimensionality in high-dimensional spaces, effectively distinguishing heterogeneous samples. We design an adaptive neighborhood adjustment method to construct similarity graphs that simultaneously balance intra-cluster compactness and inter-cluster separability. Furthermore, we develop a two-stage embedding framework: the first stage achieves preliminary cluster separation while preserving connectivity between structurally similar clusters via the similarity graph, and the second stage enhances inter-cluster separation through a label-driven distance reweighting. Experimental results demonstrate that AMSME significantly preserves intra-cluster topological structures and improves inter-cluster separation on real-world datasets. Additionally, leveraging its multi-resolution analysis capability, AMSME discovers novel neuronal subtypes in the mouse lumbar dorsal root ganglion scRNA-seq dataset, with marker gene analysis revealing their distinct biological roles.


Manual Labelling Artificially Inflates Deep Learning-Based Segmentation Performance on RGB Images of Closed Canopy: Validation Using TLS

arXiv.org Artificial Intelligence

Monitoring forest dynamics at an individual tree scale is essential for accurately assessing ecosystem responses to climate change, yet traditional methods relying on field-based forest inventories are labor-intensive and limited in spatial coverage. Advances in remote sensing using drone-acquired RGB imagery combined with deep learning models have promised precise individual tree crown (ITC) segmentation; however, existing methods are frequently validated against human-annotated images, lacking rigorous independent ground truth. In this study, we generate high-fidelity validation labels from co-located Terrestrial Laser Scanning (TLS) data for drone imagery of mixed unmanaged boreal and Mediterranean forests. We evaluate the performance of two widely used deep learning ITC segmentation models - DeepForest (RetinaNet) and Detectree2 (Mask R-CNN) - on these data, and compare to performance on further Mediterranean forest data labelled manually. When validated against TLS-derived ground truth from Mediterranean forests, model performance decreased significantly compared to assessment based on hand-labelled from an ecologically similar site (AP50: 0.094 vs. 0.670). Restricting evaluation to only canopy trees shrank this gap considerably (Canopy AP50: 0.365), although performance was still far lower than on similar hand-labelled data. Models also performed poorly on boreal forest data (AP50: 0.142), although again increasing when evaluated on canopy trees only (Canopy AP50: 0.308). Both models showed very poor localisation accuracy at stricter IoU thresholds, even when restricted to canopy trees (Max AP75: 0.051). Similar results have been observed in studies using aerial LiDAR data, suggesting fundamental limitations in aerial-based segmentation approaches in closed canopy forests.